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Abstract We show that the exact worst-case performance of fixed-step first-order methods for unconstrained 
optimization of smooth (possibly strongly) convex functions can be obtained by solving convex programs. 

Finding the worst-case performance of a black-box first-order method is formulated as an optimization 
problem over a set of smooth (strongly) convex functions and initial conditions. We develop closed-form 
necessary and sufficient conditions for smooth (strongly) convex interpolation, which provide a finite repre¬ 
sentation for those functions. This allows us to reformulate the worst-case performance estimation problem 
as an equivalent finite dimension-independent semidefinite optimization problem, whose exact solution can 
be recovered up to numerical precision. Optimal solutions to this performance estimation problem provide 
both worst-case performance bounds and explicit functions matching them, as our smooth (strongly) convex 
interpolation procedure is constructive. 

Our works build on those of Drori and Teboulle in [Math. Prog. 145 (1-2), 2014] who introduced and 
solved relaxations of the performance estimation problem for smooth convex functions. 

We apply our approach to different fixed-step first-order methods with several performance criteria, 
including objective function accuracy and gradient norm. We conjecture several numerically supported worst- 
case bounds on the performance of the fixed-step gradient, fast gradient and optimized gradient methods, 
both in the smooth convex and the smooth strongly convex cases, and deduce tight estimates of the optimal 
step size for the gradient method. 


1 Introduction to performance estimation 

Consider the standard unconstrained minimization problem 

min f(x), 

x€R d 

where / is a smooth convex function, possibly strongly convex. First-order black-box methods, which only 
rely on the computation of / and its gradient at a sequence of iterates, can be designed to solve this type 
of problem iteratively. A central question is then to estimate the accuracy of solutions computed by such 
a method. More precisely, given a class of problems and a first-order method, one wishes to establish the 
worst-case accuracy of solutions that can obtained after applying a given number of iterations, i.e., the 
performance of the method on the given class of problems. 
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Many first-order algorithms have been proposed in the literature for smooth convex or smooth strongly 
convex functions, for which one usually provides a theoretical upper bound on the worst-case accuracy after 
a number of iterations (see e.g., [213 or [BJ Chap. 6 ] for recent overviews). However, many analyses focus on 
the asymptotic rate of convergence of these bounds, rather than trying to compute exact numerical values. 
Similarly, lower bounds on the performance of first-order black-box methods on given classes of problems 
can be found in the literature (see e.g., the seminal [T9]), again often with a focus on asymptotic rates of 
convergence. In many situations, the asymptotic rate of convergence of the best available methods match 
those lower bounds. 

Nevertheless, the exact numerical value of the worst-case performance of a given method is usually 
unknown. This is because upper bounds are not assessed precisely, i.e., are known only up to a (possibly 
unspecified) constant. Another reason is that lower bounds for specific methods are not very frequently 
developed, and that general lower bounds (valid for all methods) can be quite weak for specific methods, 
especially if those methods do not feature the best possible asymptotic rate of convergence. Finally, even if 
exact numerical values are known for both lower and upper bounds, and share the same (optimal) asymptotic 
rate of convergence, a significant gap between the numerical values of those lower and upper bounds can 
subsist. If one cares about the worst-case efficiency of a first-order method in practice, this gap can translate 
into a very large uncertainty on the concrete behavior of a method. 

This work is not concerned with asymptotic rates of convergence. It will focus on the computation of 
the exact worst-case performance of a given first-order black-box method, on a given class of functions, 
after a given number of iterations. We prove that this question can be formulated and solved exactly as a 
(finite-dimensional) convex optimization problem, with the following attractive features: 

— Our formulation is a semidefinite optimization problem whose dimension is proportional to the square of 
the number of iterations of the method to be analyzed. 

— Any dual feasible solution of our formulation provides an upper bound on the worst-case performance. 
This solution can be easily converted into a standard proof establishing a bound on the performance 
(i.e., a series of valid inequalities). 

— Any primal feasible solution of our formulation provides a lower bound on the worst case performance. 
This solution can be easily converted into a concrete function on which the method exhibits the corre¬ 
sponding performance. 

— Hence our formulation is exact, i.e., its optimal value provides the exact worst-case performance. 

Our formulation covers both smooth convex functions and smooth strongly convex functions in a unified 
fashion. It covers a very large class of first-order methods which includes the majority of standard methods 
for smooth unconstrained convex optimization. It can be applied to a variety of performance measures, such 
as objective function accuracy, gradient norm, or distance to an optimal solution. 


1.1 Formal definition 

Our goal is to express the worst-case performance of an optimization algorithm as the solution of an op¬ 
timization problem. This approach was pioneered by Drori and Teboulle m, who called it a performance 
estimation problem (PEP). We now provide a formal definition for this problem. 

We consider unconstrained minimization problems involving a given class of objective functions, and 
only treat first-order black-box methods. This means that the method can only gather information about 
the objective function using an oracle Of, which returns first-order information about specific points, i.e., 
Of(x ) = {/(a;), V/(a:)}. Formally, the first N iterates generated by a first-order black-box method M (which 
correspond to N calls of the oracle), starting from an initial point xq, can be described with 

xi = Mi (x 0 ,Of{x 0 )), 

x 2 = M-2 {x 0 ,Of(x 0 ),Of(x 1 )), 

(1) 


Xl V = Mn (zo, Of(x o), . . . , Of{XN-l)) ■ 
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In order to measure the performance of a given method M on a specific function / with a specific starting 
point, we introduce a performance criterion V to be minimized, that will only depend on the function / and 
the sequence of the iterates {.To, xi, ..., xn} generated by the method. Since we are in a black-box setting, 
we require that the criterion can be computed from the output of the oracle Of, which has only access to 
the iterates as well as to an additional point x* , defined to be any minimizer of function / (the latter being 
necessary if the criterion has to compare iterates to an optimal solution). 

Examples of this performance criterion V(Of, xo, ■ ■ ■, xjv, x*) include the objective function accuracy 
/(xjv) — /(x*), the norm of the gradient ||V/(xjv)||, or the distance to an optimal solution ||xjv — x*|| (see 
also Section 4.3 for an example of criterion that does not only depend on the last iterate). 

Finally, we consider a given class T of smooth convex or smooth strongly convex functions, over which 
we wish to estimate the worst-case performance of a method after N iterations. Note that the dimension 
of functions belonging to class T is left unspecified; in particular we will allow class T to contain functions 
with varying dimensions, in order to obtain dimension-free results. 

As methods try to minimize the performance criterion, their worst-case performance is obtained by 
maximizing V over functions in T , which can be written as 


w(J r ,R,M,N,'P) = sup V(Of,xo,...,XN,x*) (PEP) 

f ,X 0 ,...,X N ,X, 

such that / G T 

x* is optimal for /, 

xi, ..., xjv is generated from xo by method M with Of, 

11To - X*|| 2 < R. 


Parameter R was introduced to bound the distance between the initial point xo and the optimal solution 
x*. Indeed, it is well-known that in most situations, performance of a first-order method cannot be sensibly 
assessed without such a constraint (see also the discussion of Section 3.51. 


1.2 Finite-dimensional reformulation using interpolation 


Because it involves an unknown function / as a variable, problem (PEP) is infinite-dimensional. Nevertheless, 
using the black-box property of the method (and of the performance criterion), we will show that a completely 
equivalent finite-dimensional problem can readily be formulated by restricting the variable / to the knowledge 
of the output of its oracle Of on the iterates {xo,xi,... , xjv} and x*. Indeed, denoting the output of the 
oracle at each iterate Xi by Of{xi) = {fi,gi}, method M defined by (jT|) can be equivalently rewritten as 


xi = Mi (x 0 , fo, go ), 
x 2 = M 2 (xo,fo,go,fi,gi), 

( 2 ) 


xn = Mn (xo, /o, <7o, ■ • •, /jv-i, <?jv-i) ■ 


Now, defining a set I = (0,1,2,..., TV", *} for the indices of the iterates, we can reformulate (PEP I into 
a problem involving only the iterates {xijigj, their function values {/i}ig/ and their gradients {gi}i£i as 
(using equivalence between optimality of x* and constraint g * = 0, as our problem is unconstrained) 


w f (R,R,M,N,V)= sup V({xi,gi,fi} ieI ), (f-PEP) 

such that there exists / G T such that Of(xt) = {fi,gi} Vi G I, 

g* = 0, 

xi, ..., xjv is generated from xo by method M with {/j, gi} ie f 0 N _ 1 }, 
||xo - x*|| 2 < R■ 
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The crucial part of this reformulation is the first constraint, which can be understood as requiring that the 
set of variables {xi, gi, fi} ieI can be interpolated by a function belonging to the class T. This optimization 
problem is strictly equivalent to the original (PEP I in terms of optimal value, since every solution to (f-PEP ) 
can be interpolated by a solution of (PEP I and, reciprocally, every solution of (PEP I can be discretized to 
provide a solution to (f-PEP I. From that it is clear that w(T ', R , A4, N, V) = w 1 (J 7 , R , A4, N, V). 


1.3 Paper organization and main contributions 


We focus on the class of smooth (strongly) convex functions. Therefore an exact formulation of problem 
(PEP) as (f-PEP) will require a set of necessary and sufficient conditions for the existence of a smooth 


strongly convex interpolating function, which is the main result obtained in Section [2] This set of conditions, 
which is of independent interest, was previously only known for general nonsmooth convex functions. Our 
approach is fully constructive, as we also exhibit a procedure to interpolate a smooth (strongly) convex 
function from a set of points with their associated gradients and function values, when such an interpolating 
function exists. 

In Section [3j we show how the resulting finite-dimensional (f-PEP) problem can be reformulated exactly 
into a (convex) semidefinite optimization problem, which provides the first tractable and provably exact 
formulation of the performance estimation problem. We allow consideration of both smooth convex and 
smooth strongly convex functions, as well as a large class of performance criteria. 

Section [4] then tests our approach numerically on several standard first-order methods, including the 
constant-step gradient method, the fast gradient method and the optimized gradient method from p3]. We 
are able to confirm several bounds appearing previously in m, and to conjecture several new worst-case 
performance bounds, including bounds for strongly convex functions, and bounds on the gradient norm 
(either for the final iterate, or the smallest norm among all iterates). Another byproduct of our results is 
a tight estimate of the optimal step size for the gradient method on smooth convex and smooth strongly 
convex functions. 


1.4 Prior work 


Drori and Teboulle [10] were first to consider the notion of a performance estimation problem. They focus 
exclusively on the case of smooth convex functions equipped with the performance criterion /(*jv) — /*, and 
introduce the idea of reducing (PEP) to a finite-dimensional problem involving only the iterates x t . their 
gradients and function values /i, along with an optimal point x * and optimal value /*. They treat several 
standard first-order algorithms, namely, the standard fixed-step gradient algorithm, the heavy-ball method 
and the accelerated gradient method [20]. In their approach, ( PEP| ) is expressed as a non-convex quadratic 
matrix program [2], which is then relaxed and dualized. The resulting convex problem is then used to provide 
bounds on the worst-case performance (and, in some cases, is solved analytically). As will be shown later in 
this paper (see Section [4|, because of the use of a relaxation and the dualization of a non-convex problem, 
these bounds are in general not tight, although they are in many special cases. 

A Section in udi is also devoted to the optimization of the coefficients of a fixed-step first-order black¬ 
box method. More precisely, a numerical optimization solver is used to identify a method performing best 
according to their relaxation of the performance estimation problem. This approach is taken further in iT3], 
which provides an analytical description of this optimized method. Again we stress that, due to the non¬ 
tightness of the relaxation in general, these optimized methods are not guaranteed to have the best possible 
performances. 

Another computational approach for the analysis and design of first-order algorithms is proposed in US], 
in which optimization procedures are regarded as dynamical systems. Integral quadratic constraints (IQC), 
which are usually used to obtain stability guarantees on complicated dynamical systems, are adapted in 
order to obtain sufficient conditions for the convergence of optimization algorithms. This methodology is 
able to establish iteration-independent linear rates of convergence by solving a single small semidefinite 
program. However those bounds, valid for any number of iterations, are in general not tight, i.e., more 
conservative than ours and those of m when used to estimate worst-case performance after a given finite 
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number of iterations (see Subsection 4.1.2 for an example). In addition, while this methodology is well-suited 
for studying the linear convergence rates of algorithms for smooth strongly convex optimization, it fails to 
recover the exact sublinear rates in the non-strongly convex case. 


2 Smooth strongly convex interpolation 


This section develops a necessary and sufficient condition for the existence of a smooth strongly convex 
function interpolating through a given set of data triples {xi,gt, i.e., deciding whether there exists a 

smooth strongly convex function / such that f(xt) = fi and gt G df{xi) for all i G I. 

This result generalizes the well-known set of conditions guaranteeing the existence of a convex, possibly 
nonsmooth interpolating function (see Theorem |T] in Subsection 2.31. It is the main technical ingredient of 
our exact convex reformulation of performance estimation problems. 


2.1 Definitions and problem statement 

We start by defining the functional class of interest, using the standard point of view from convex analysis — 
we refer to classic books mrnmm for details. Given two parameters p and L satisfying 0 < p, < L < +oo, 
we consider proper closed convex functions (i.e., whose epigraphs are non-empty closed convex sets) satisfying 
both a smoothness condition (depending on the parameter L, which is the Lispchitz constant of the gradient) 
and a strong convexity condition (depending on the parameter /i). We explicitly allow the case L = +oo to 
include nonsmooth functions, while g on the other hand is always assumed to be finite. In the rest of this 
paper, we use the conventions l/+oo = 0 and +00 — g = +00 to deal with the case L = +00. 

Definition 1 (L-smooth //-strongly convex functions) Consider a proper and closed convex function 
/ : R d —» lU {Too}, and constants g G R + , L G R + U {Too} with g < L. We say that function / is 
L-smooth p-strongly convex (which we denote by / G L^^R^)) if and only if the following two conditions 
are satisfied: 

(a) inequality -p\\gi — <72 11 2 < [|*i — £C2 1| 2 holds for all pairs X\,X2 G R d and corresponding subgradients 
ffi,S2 G R d (i.e., such that 31 G df(x 1) and g 2 G df(x 2)); 

(b) function f(x) — ^ ||ar||^ is convex. 


This definition is not entirely standard, as it involves subgradients and allows the constant L to be 
equal to Too. In the case of a finite L, condition (a) immediately implies uniqueness of the subgradient at 
each point, hence differentiability of the function, and we recover the well-known Lipschitz condition on the 
gradient of a smooth function. On the other hand, when L = 00, condition (a) becomes vacuous, and the 
function can be non-differentiable. Condition (b) can easily be seen to be algebraically equivalent to the 
standard definition of strong convexity (and the case g = 0 corresponds to a convex but not strongly convex 
function). The class of proper closed convex functions simply corresponds to Lo, 00 . The case L = g can be 
safely discarded, as it only involves simple quadratic functions whose minimization is trivial. The reason for 
this slightly non-standard definition of and the possibility of choosing L = Too will become clear later, 
when dealing with Legendre-Fenchel conjugation. 

As explained above and in the introduction, our approach to express the original infinite-dimensional 
(PEP) in a finite-dimensional fashion relies on an interpolating condition for smooth strongly convex func¬ 
tions. This directly motivates the following definition. 


Definition 2 (L M ,z,-interpolation) Let I be an index set, and consider the set of triples 
S = {(a 'i, gt, where Xi,gi G R d and fi G R for all » G /. Set S is IF^^L-interpolable if and only if there 

exists a function / G Lj^z^R^) such that we have both gi G df(xi) and fixf) = fi for all i G I. 







6 


A.B. Taylor, J.M. Hendrickx, F. Glineur 



Fig. 1 Example (xi, gi, fi) = (—1, —2.1) and (xo, g 2 ■ f 2 ) — (0, —1, 0) for / = {1, 2} and d = 1. This set satifies conditions 
( |Clf| ) but cannot be interpolated by a smooth convex function: the convexity requirement forces the interpolating convex 
function to lie entirely above its linear under-approximations, which lead to an unavoidable non-differentiability at xi. 


2.2 Necessity and sufficiency of conditions for smooth convex interpolation 


Our goal is to identify a set of necessary of sufficient conditions involving the set of data triples and charac¬ 
terizing the existence of an interpolating function. Finding necessary conditions is relatively easy: starting 
from any set of necessary conditions that holds on the whole domain of a smooth strongly convex function, 
one can simply restrict this set to those conditions involving only points Xi with i E I (i.e., to discretize it). 
For example, it is well-known that the class of L-smooth convex functions -To,1,(1^) is characterized by the 
pair of inequalities 


f{y) > f{z) + X7f(z) T (y - z), V y,z E R d , 
IIV/(y) - V/(z)|| 2 < L\\y ~ z\\ 2 , V y,z £ 


(Cl) 


Therefore, specializing those conditions for y = Xi and z = Xj with i,j E I leads to the following set of 
inequalities, which is necessary for the existence of an interpolating function in 


fi > fj + gj(xi ~Xj), Vi, j E I, 

1 9i ~ 9j\\2 < L\\xi - Xj\\ 2 , Vi, j E I. 


(Clf) 


Now, perhaps surprisingly, it turns out that this latter set of conditions is not sufficient to guarantee 
-Fo.L-inte] 

fe t 0 , l { 


J-q interpolability, despite the fact that the originating conditions (Cl) are sufficient to guarantee that 
l d ). In order to see that, consider the following example with / = {1,2} and d = 1: 


(xi,gi,fi) = (-1,-2,1) and (. x 2 , 92 ,f 2 ) = (0, —1,0). 

This pair can clearly not be interpolated by a function in ) for any L, as there is an unavoidable 

non-differentiability at xi as illustrated on Fig. [l] However, it satisfies Conditions (Clf I with L = 1, which 
is therefore not sufficient to guarantee smooth convex interpolation. 

Similarly, we can carry out the same exercise for the following conditions, also well-known to be equivalent 
to inclusion on J-o,L(lR d ) when imposed on the whole space: 

fi > fj + gj{xi -Xj), V i,j E I, (C2f) 

fi < fj +gj(xi - Xj) + ^||Xi - Xj\\l, V i,j E I. 

With an appropriate use of an additional dimension [d = 2), one can readily observe that some information 
may be hidden to this pair of inequalities. Consider the example 


(xi,gi,fi) = 


0) and (x 2 , g 2 , f 2 ) = 


.1 


from which no smooth convex interpolation can be made (again, unavoidable non-differentiability at both 
xi and x 2 ). However, both Conditions Clf and C2f are satisfied with L = 1 . 

Those examples illustrate the weakness of a naive approach that consists in discretizing standard nec¬ 
essary and sufficient conditions defined on the whole space. If those discretized conditions were used in a 
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performance estimation problem over a given class of functions they would implicitly allow the per¬ 

formance of functions that do not belong to the class to be taken into account. This would correspond 
to a relaxation of the original performance estimation problem, and would only lead to upper bounds on the 
worst-case performance. To conclude this section, note that any set of necessary and sufficient conditions for 
smooth convex interpolability must be the discretization of some necessary and sufficient conditions on the 
whole domain, whereas the previous examples precisely show that the converse is not true. 

In the next subsections, we follow a more principled approach in order to tackle the -^^-interpolation 
problem. We start with a special case of convex interpolation, that of proper convex functions without 
smoothness or strong convexity requirement (i.e., the class Po.oo), for which a solution is well-known. 


2.3 Convex interpolation 

In order to build interpolation conditions for the class of smooth strongly convex functions, we begin by 
constructing interpolation conditions for the simpler class of convex functions ■7 r o,oo(R <2 )- As this result will 
be one of building blocks of the smooth strongly convex interpolation procedure, a simple constructive proof 
of this theorem is provided. 

Theorem 1 (Convex interpolation) The set {(xi, gi, fi)} ieI is J r o t00 -interpolable if and only if 

fi > fj + g] ( Xi - Xj) Vi,j G I. (3) 


Proof (Necessity.) Assume there exists a convex function / : R d —* R such that fi = f(xi) and gi G 
df(xi) Vi G I. The definition of a subgradient then immediately implies that 

fi > fj + 9j (xi - Xj) Vi, j G I. 

(Sufficiency.) Define the following piecewise-linear convex function 

f(x) = max j/, +g](x- Xj)} . 

Since / is the pointwise maximum of a finite number of affine functions, its epigraph is a non-empty poly¬ 
hedron, and hence / is convex, closed and proper. In addition, f(xi) = fi holds by construction. Indeed, we 
first see that 


fi 


= fi + 9i{Xi - Xi ), 

< max j/j + g] (a- Xj)} = f(xi). 


Therefore, we have fi < f(xi). In addition to this, we have 


/( Xi) = max |/j -I- g T j (x t - Xj)} , 

< fi using Condition © for each j, 


which allows to conclude that f{xi) = fi. The construction also implies that gi G df(xi), i.e., 

/( x) = max j/j + g^ix — Xj ) j Vx G R d , 

> fi + g[{x — Xi) Vi G /, x G R d , 

> f(xi) + gj (x — Xi) Vi G /, x G R d . 


□ 
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One should note that the effective domain of / is dom / = R d — that is, the function takes finite values for 
all x G R d . This is of course not the only way of reconstructing a valid /. For example, we could choose 


/(z) 


ma xj {fj +g](x- Xj)} if x e conv ({a;i} ieJ ) , 
+oo otherwise. 


Remark 1 Our interpolation problem is an extension of the classical finite convex integration problem, which 
is concerned with the recovery of a convex function from a set of points Xi associated with a subgradient 
gt (i.e., function values are not specified). Finite convex integration is treated in details in [15] (only in the 
convex case g = 0 and L = +oo). It is the finite version of the continuous convex integrability problem, 
which is treated in [23]. 

A direct necessary and sufficient condition for deciding whether a set is convex integrable is to require 
the existence of function values fi for which the set {(xi,gi, fi)} ieI is convex interpolate. It is however also 
possible to derive a set of inequalities that does not involve unknown function values fi, using the so-called 
cyclic monotonicity conditions. More precisely, consider for every sequence {ii,i 2 , ■ ■ ■, ii v} of distinct indices 
in I, and the corresponding cyclic sequence of successive pairs {(ii, * 2 ), (* 2 , * 3 ), ■ ■ ■, (ijv-i, ijv), (ijv, *i)}- 
Summing inequality § over each pair of indices in this cyclic sequence produces a necessary inequality that 
does not involve function values fi. Moreover, the set of all such inequalities, obtained from all possible 
sequences of distinct indices, is necessary and sufficient for convex integration, see e.g. Q5U221- Note that 
this condition features a much larger number of inequalities. 


2.4 Conjugation and minimal curvature subtraction 

In this section, we review some concepts and results needed for our generalization of convex interpolation to 
smooth convex interpolation. We begin with the concept of conjugation operation. This operation is a key 
element in our approach, since it provides a way to reduce the general smooth strongly convex interpolation 
problem to a simpler convex interpolation problem. 

Definition 3 Given a proper function / : R d —> R U {+00}, the (Legendre-Fenchel) conjugate /* : R d —> 
R U {+00} of / is defined as: 

f*{y) = sup y T x- f(x). 

xev. d 

Conjugate functions enjoy numerous useful properties. Among other, they are always closed and convex. In 
fact, conjugation realizes a one-to-one correspondence in the set of proper closed convex functions (i.e., an 
involution), see for example Theorem 12.2 of |23| . That is, for any / 6 Jo 00 (R d ) we have f* G J-q oo (R d ) 
and /**=/. 

Among the useful properties of conjugate functions, we note that for any function / G Jo, 00 )) conjuga- 
tion can be seen as an operation reversing the roles of the coordinates and the subgradients: any subgradient 
(resp. coordinate) in one space becomes a coordinate (resp. subgradient) in the second space. In other words, 
for any function / 6 Jo ,oc(R d ) and its conjugate f*, it is equivalent to require that x and g satisfy condition 
g G df(x), condition x G df*(g) or condition f{x) + f*(g) = g T x. This can be obtained using first-order 
optimality conditions on the definition of conjugate function; we refer to Theorem 23.5 from [23] for more 
details. The next theorem emphasizes the effect of this link for the class of smooth convex functions. 

Theorem 2 Consider a function f G Jo,oc(R d ). We have f G Jo,i(® d ) */ and only if f* G -Ji/i,,ooO® d )- 

Theorem [ 2 ] is basically Proposition 12.60 of [23] in the case L < +00 and reduces to Theorem 12.2 of [23] in 
the case L = +00. This also explains why we need to include the case L = +00 in our interpolation problem: 
this is so that we can include the conjugates of smooth but non-strongly convex functions in Jo,L(R d ). 

The next lemma gives a simple way of expressing smooth strongly convex functions in terms of smooth 
functions. 

Theorem 3 Consider a function f G Jo,oo(R d ). We have f G J Mi .l(R d ) if and only if f(x) — ^^ 
Jb, L - M (R d ). 

This theorem holds true by Definition [l] when L = +00. The case L < +00 can be found in the proof of 
Theorem 2.1.11 in [2TI . 
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2.5 Necessary and sufficient conditions for smooth strongly convex interpolation 

We now focus on transforming the smooth strongly convex interpolation problem into a convex interpola¬ 
tion problem. In order to do so, we mainly use the two previously defined operations: conjugation (using 
Theorem [ 2 ]) and minimal curvature subtraction (using Theorem [3|. The reasoning is the following: 

(i) Reformulate the J-^,l interpolation problem into a interpolation problem using minimal curvature 

subtraction. 

(ii) Write the interpolation problem into a h ),00 interpolation problem using Legendre-Fenchel 

conjugation. 

(iii) Transform the J~i/^l—^),oo interpolation problem into a J~. 0,00 interpolation problem using again minimal 
curvature subtraction. 

The effect of minimal curvature subtraction on our interpolation problem, used in steps (i) and (iii), is 
described by the following Lemma. 

Lemma 1 Consider a finite set {(a 'i,gi, fi)} ieI with Xi,gi € R d and fi £ R. The following propositions are 
equivalent for any constants 0 < p < L < + 00 : 

(a) {(xi, gi, fi)} xeI is T^'L-interpolable, 

(b) {( x i: gi - p,Xi,fi - %\\xi\\l)} ieI is To^L-^-interpolable. 

Proof [(a) =4> (6)] It follows from Theorem [ 3 ] that if there exists / £ interpolating the set, then 

h{x) = /( x) — ^ \\xW 2 exists and satisfies h £ .Fo,.L-|i»(R d ) and Vi £ I: 

h{xi) = fi - ^llajillj, gi ~ g-Xi £ dh(xi). 

Hence, the function h £ •7 r o ) i,_ Al (R d ) interpolates the set {(xi,gi — pXi,fi — ^||ic*Ha )} ieI - 

[(a) <= (6)] If such a h £ J r o,i,-^(R d ) exists and satisfies the interpolation conditions (5), then one can 
reconstruct a function f(x) = h{x) + ^||a:|| 2 , / £ ^^(R^) which interpolates the set {(xi, gi, fi)} ieI - □ 

The effect of conjugation in step (ii) of the reduction procedure is precisely described in the following lemma. 

Lemma 2 Consider a finite set {( Xi,gi , fi)} ieI with Xi,gi £ R d and fi £ R. The following propositions are 
equivalent for any constant 0 < L < + 00 : 

(a) {(xi,gi, fi)} ieI is T 0 ,L-interpolable, 

(b) {(gi,Xi,xlgi - fi)} ieI is T x / L<oc -interpolable. 

Proof [(a) =>- (6)] It follows from Theorem [ 2 ] that if there exists / £ J-o,i.(R d ) then /* exists and satisfies 
/* £ J 7 i/L,oo(R d )- addition to that, if both / £ J-o,l(R d ) and /* exists, then they satisfy Vi £ / the three 
conditions (e.g., Theorem 23.5 of [23j): 

f(xi) + f*(gi) = gjxi, gi£df(xi), Xi£df*{gi). 

[(6) =>■ (a)] If a function /* £ .Fi/£ j00 (R d ) exists and satisfies the interpolation conditions (b), then the 
conjugate /** (which is convex, proper and closed by construction) satisfies /** £ J-o,L(R d ) by Theorem [ 2 ] 
as well as the interpolation conditions (e.g., Theorem 23.5 of [23]) Vi £ /: 

f**{xi) + f*(gi) = gjxi, gi£df**(xi), Xi£df*(gi). 

We obtain the desired result by choosing / = /**• □ 

We are now properly armed in order to define all interpolation equivalences. Let us now use steps (i), (ii) 
and (iii) to prove the main theorem of this section. 
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Theorem 4 (ft^.L-interpolability) Set {(x;, gi, fi)} ieI is -interpolate if and only if the following 
set of conditions holds for every pair of indices i £ I and j £ / 


fi fj ft (Xi Xj ) > 


2(1 - p,/L) \L 


In ,, 2 , .. 11 _ _,|| 2 0 ft / _ _ -,T 


T lift -9j\\ 2 +lA\xi - Xj\\ 2 - 2-{g 0 -gi) ( Xj - Xi) 


(4) 


Proof We begin by showing the equivalence between the following propositions: 

(a) {(xi,gi, fi)} ieI is Jy^L-interpolable, 

(b) {(xi, 5 i - [iXi.fi ~ 2 ll^-illa)}*e.r is -^o,L- M -interpolable, 

(c) {(ft - [iXi.Xi.xJgi - fi - %\\xi\\l)}. eI is J r i/( i _ M ) i00 -interpolable, 


(d) {(ft-ft^f^ 


9i 

L — ft 


5 L — fi 


fi 


Lg.\\xi\\ 2 

2 (L-,1) 



gi 

L — fi 


+ fi 


2 (i-M) 


IMI? 

2(£-ft 

MS 

2 (L — fi) 



is J-o,oo-interpolable, 
is J-o,oo-interpolable. 


Both (o) <t4 (ft) and (c) (d) are direct applications of Lemma [I] whereas both ( b ) (c) and ( d ) (e) are 
direct applications of Lemma[2] Theorem[4]follows from equivalence between propositions (a) and (e) applied 
to the necessary and sufficient conditions for convex interpolation of Theorem[I] Finally, it is straightforward 
to check that condition (e) reduces to the statement of the theorem. □ 


Remark 2 Note that one can also easily construct an interpolating function /(x) for the original set of 
points from Theorem |4ja) . It follows from Theorem [T] that a possible interpolating function for the set 
{ (xi, gi, fi'j | ^ of Theorem |ljc) is given by 


1 

- ft) 

This can be conjugated into an interpolating function ft*(x) of the set given by Theorem 0b). Using Theo¬ 
rem 16.5 from [23], this can equivalently be written in the form 


x — Xi\\ 2 > = max hi(x). 


h(x) = max \ fi + gj (x - xf) + 


2 (L 


h*(x) = conv (ft*(x)) , 


where the conv operation takes the convex hull of the epigraphs of the ft* ’s. Hence an interpolating function 
for the original set {(xi, gi, is given by 

/(x) = conv (ft*(x)) + |||x|| 2 - 

We provide an example of such an interpolating function on Fig. [2j 

It is straightforward to establish the equivalent interpolation conditions for both the smooth but non- 
strongly convex case (p = 0) and the nonsmooth strongly convex case (L = +oo). In the first case — given 
by Corollary [l] — we find the discrete version of the well-known inequality characterizing L-smooth convex 
functions, which turns out to be necessary and sufficient: 

/(x) > f(y) + V/ T (y)(x - y) + ^r||V/(x) - V/(t,)||*. 


Corollary 1 The finite set {{xi, gi, fi)} ieI is lFo,L-interpolable if and only if 


T, 1 2 

fi — fj ft \ x i — x i ) + 2Tj \ ^i — 


Vi,j £ I. 
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Fig. 2 Example of an interpolating function; the data triples to be interpolated by a 1-smooth convex function are 
(xi, gi, fi) = (2,2,3) and ( X 2 ,g 2 ,f 2 ) = (—3,—1,1). Figure shows the upper-bounding quadratic functions h*(x) (red, 
left), the interpolating function f(x) = conv ( h*(x )) (blue) and the gradients (black tangents). 


Nonsmooth strongly convex interpolation conditions are given in Corollary [2j which corresponds to the 
well-known inequality characterizing the subgradients of strongly convex functions. 

Corollary 2 The finite set {(x^, gi, is T-interpolate if and only if 

fi > fj + 9j i x i ~ Xj) + |||Xi - Xj\\l, Vi, j £ I. 


Remark 3 Following the spirit of Remark 1, we note that Theorem [4] can readily be extended to handle 
the finite (and continuous) integration problems for L-smooth fi- strongly convex functions (i.e., interpola¬ 
tion without function values). Indeed, summing inequality 0 from Theorem [4] over each pair in the cyclic 
sequence {(* 1 , 12 ), (* 2 , 13 ), ■ ■ ■, (ijv-i, ijv), (ijv, * 1 )} produces a necessary inequality that does not involve func¬ 
tion values fi. Moreover one can show that the set of those inequalities for all possible sequences is necessary 
and sufficient for finite convex integration of L-smooth /i-strongly convex functions, generalizing the standard 
cyclic monotonicity conditions. As an illustration, note that the following inequality 

T l/l 2 

(. 9i — 9j) { x i — x j ) ^ 2 + n/L (l ^ ^W Xi ~ x i II 2 J > 

that is standard in the analysis of gradient methods on smooth strongly convex functions (see e.g., Theorem 
2.1.12 of corresponds to cycles of length 2 (i.e., cyclic sequences (j, *)}). The set of all such 

inequalities is necessary but not sufficient, as it omits longer cycles. 


3 A convex formulation for performance estimation 

As explained in the introduction, our performance estimation problem can now be expressed in terms of the 
iterates and optimal point {xi,gt, fi} ie { 0 N only, using the interpolation conditions given by Theorem|4j 

As our class of functions (M rf ) and the first-order methods we study are invariant with respect to 
both an additive shift in the function values and a translation in their domain, we can assume without loss of 
generality that x* = 0 and /* = 0, which will simplify our derivations. We can also assume g* = 0, from the 
optimality conditions of unconstrained optimization. The problem can now be stated in its finite-dimensional 
formulation: 
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w^(R,M,N,V) = 




sup 


P {{Xi,gi,fi}i 


eu 


(d-PEP) 


such that {xi, gt, fi} ieI is -7>,z,-interpolable, 

Xi ,..., Xjv is generated from xo by method M with ©>. 
{a;*, g*,f*} = {0 d ,0 d ,0} and ||aao - x*\\ 2 < R. 


Problem (d-PEP I is an instance of ( f-PEP[ ) where the function class T is chosen to be the 

set of d-dimensional L-smooth /r-strongly convex functions, hence we have w(.F/i,z,(IR d ), R, Ai, N, V) = 
w^ L (R,M,N,V). Interestingly, in most situations of interest, quantity w^\{R,M,N,V) is monotoni- 
cally increasing with d, as a higher dimensional function can usually mimic a lower dimensional one (see 
Theorem [H] and subsequent results for a discussion on finite convergence of this sequence). 

Finally, note that problem (d-PEP) is not convex, as it involves several non-convex quadratic constraints 
(e.g., gjxi terms in the interpolation conditions). In the next section, we show how (d-PEP) can be cast as 
a convex semidefinite program [26] when dealing with a certain class of first-order black-box methods, those 
based on fixed steps. 


3.1 Fixed-step first-order methods 

We hereby restrict ourselves to the class of fixed-step first-order methods, where each iterate is obtained by 
adding a series of gradients steps with fixed step sizes to the starting point xo- 

Definition 4 A method A4 is called a fixed-step method if its iterates are computed according to 


2 — 1 

Xi = x 0 - ^ hi, k g k . 
k =0 


with fixed scalar coefficients /iqfe. 

A fixed-step method performing N steps is completely defined by the lower triangular N x N matrix 
H = {hi, k } 1<i<N o<fc<jv —1 (where hi ik is defined to be zero if k > i). Note that many classical methods 
such as the gradient method with constant step size (GM) and the fast gradient method (FGM) are included 
in this class of algorithms (see the details in Section [4]). 


3.2 A convex reformulation using a Gram matrix 


In order to obtain a convex formulation for (d-PEP), we introduce a Gram matrix to describe the iterates 
and their gradients. Denoting 

P = [go g\ ■■■ gN x Q ] 

we define the symmetric (N + 2) x (N + 2) Gram matrix G = P T P G S N+2 , which is equivalent to 


G — {Gi,j}o<i,j<N with 1 


Gij = gjgj for any 0 < i, j < N, 
Gn+ij = 9j for any 0 < j < N, 
Gi,N +1 = gj xo for any 0 < i < N, 

, Gn+i,n+i = Xq Xo 


(note that the size of this matrix does not depend on the dimension of iterate xo and gradients gt). 

The constraints in problem (d-PEP) can now be entirely formulated in terms of the entries of the Gram 


matrix G along with the function values fi. Indeed all iterates apart from xq can be substituted out of the 
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formulation using Definition [4] of a fixed-step method, and the resulting formulation only involves function 
values fi and inner products between xo and all gradients gi. 

Note that the initial iterate xo and successive gradients gi of any solution to problem (d-PEP I can be 
transformed into a symmetric and positive semidefinite Gram matrix G. Moreover, since vectors xo and gi 
belong to R d , matrix G has rank at most d. In the other direction, it is easy to see that any symmetric and 
positive semidefinite Gram matrix G of rank at most d can be converted back (using Cholesky decomposition 
for example) into N +2 vectors xo G R d and gi G R d which describe the initial iterate and successive gradients 
of a d-dimensional function (this transformation is however not unique). From those observations we can 
anticipate that an equivalent formulation of (d-PEP I will rely on imposing that G is symmetric and positive 
semidefinite, which is a convex constraint and will naturally lead to a semidefinite program. 


3.3 Exact worst-case performance of fixed-step first-order methods as a semidefinite program 

For notational convenience, we define vectors hi G R w+2 for any i between 0 and N and /i* G R iv+2 as 
follows 

h[ = [-hi,o -hi ,i ... -hi,i -1 0 ... 0 1], hi = [0 ... 0 ], 

so that we have Xi = Phi. In order to lighten the notations we also define Ui = ei+i G R^" 1 " 2 , the canonical 
basis vectors, and u* the vector of zeros. Using those notations, we rewrite the interpolation constraints 0 
coming from Theorem [4] in the following form: 

fi ^ fj T — Ghi 'LLj Ghj) T J ('LLi "Uj ) G(u% U j ) 

+ jf—^(ujGhj - ujGhi) + ^ ( hi - hj) r G(hi - hj), for all i,j G I. 

We can equivalently formulate all constraints using the trace operator, and add the distance constraint 
||a:o — •i'* II 2 — ^ on the starting point as well as the positive semidefiniteness constraint for G. Defining 
matrices A. (J and Ar in the following way 

2Aij — y (uj(hi hj ) -{- (/lj hj)Uj 'j “j“ y (u^ 'U’j ) (jlLi Uj ) 

j-J — /i \ / ±J — fJj 

+ -—— (ui(hj - hi) T + ( hj - hi)uj] + f fL {hi - hj)(h z - hj) T , for all i,j G I , 

±j — M'' /J-j — [JL 

Ar = un+iun+i- 

we obtain the following compact formulation for the feasible region that is linear in its variables / G R^ 1 
and G G § Ar+2 


fj ~ fi + Tr ( GAij ) < 0, for all i,j G /, 

Tr (GAr) - R 2 < 0, 

Gy o. 

From the discussion at the end of the previous section, it is easy to see that any d-dimensional function / 
and starting point Xq G R d produce a feasible solution (/, G ) where matrix G has rank at most d. On the 
other hand, any feasible solution (/, G) where G has rank at most d can be interpolated into a d-dimensional 
function / G Jy J ../- J (R fi ) and a starting point xo G R d . Indeed, matrix G = P T P provides xo G R d and 
N + 1 successive gradients ji G R d , while the other iterates Xi derive from the definition of the method. Our 
interpolating conditions ensure that a function compatible with these data triples {xt,gi, fi} ieI exists. 

Considering finally the performance criterion V, we observe that any concave semidefinite-representable 
function in G and / leads to a worst-case estimation problem that can be cast as a convex semidefinite 
optimization problem (see e.g., 0 ), and that such a criteria does not depend on the dimension d of the 
functions. In particular, linear functions of the entries of / and G are suitable. Classical performance criteria 
such as /(*jv) — /*, || V /{x iv) || 2 an d ||a;jv — x* || 2 are indeed covered by this formulation. We focus below 
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on the case of a linear performance criterion, but note that other criteria can be useful (see for example a 
concave piecewise linear criteria used in Section 4.3). 

We can now state the main result of this paper. 


Theorem 5 Consider the class of L-smooth fi-strongly convex functions with 0 < p, < L < oo, a 

fixed-step first-order method that computes N iterates according to matrix H G R NxN , and a performance 
criterion Vb,c(f,G) = b T f + Tr (CG) that depends linearly on the function values at those iterates and 
quadratically on the iterates and their gradients (b G R ^ -1-1 and C G 

If N < d — 2, the worst-case performance after N iterations of method H applied to some function in 
R d ) is equal to the optimal value of the following semidefinite program 

w sd l (R, H, N, b, C) = sup b T f + Tr (CG) (sdp-PEP) 

' GeS w + 2 ,/GR w + 1 

such that fj — fi + Tr (■ GAij ) < 0, i,j G I, 

Tr (GA r ) - R 2 < 0, 

Gh 0, 


with matrices Aij and Ar as defined above. In others words, 


w 


sdp / 
p,L\ 


{R, H, N, b, C) = ui^ L (R, M , N, 7\c) for any d> N + 2. 


Proof We have already shown the two-way correspondence between functions in J r /ij / J (R rf ) and feasible 
solutions of this problem where matrix G has rank at most d. Since matrix G has size N + 2, it has rank 
at most IV + 2, which establishes that this semidefinite program is a correct formulation of the performance 
estimation problem when d > N + 2. □ 


The optimal value w s £ (R, H, N, b , C ) of (sdp-PEP I is not necessarily finite or attained at some feasible 


point. However, when L is finite, any continuous performance criterion V will force the optimal value to be 
attained and finite, because the constraints on variables / and G force the domain to be compact. 

Proposition 1 Under the assumptions of Theorem [d[ the optimum of (sdp-PEP) is attained and finite 
when L < oo. 


Proof To show that the solution of (sdp-PEP I is attained and finite, it suffices to prove that its feasible 


region is compact (since the objective is continuous). We first prove that the iterates of method H applied 
to any function in P^.l (R d ) with L < oo are bounded, as well as their gradients. 


Note that the Lipschitz condition on the gradients (Clf) with j = * shows that if iterate Xi is bounded 


gradient gi is also bounded. We proceed by recurrence. We start with the fact that xo is bounded, using the 
assumption that £* = 0 and constraint ||a;o — x*|| 2 < R. This implies that go is bounded, hence that xi is 
bounded using Definition [4] of a fixed-step method. This implies in turn that gi is bounded, then that X 2 is 
bounded, and so on until we have shown that all iterates and gradients are bounded. 

Condition (C2f) with j = * then implies that function values fi are bounded. Therefore all entries in 


variables / and G are bounded which, combined with closedness of the feasible region, establishes the claim. 

□ 


Remark f When L = +oo (recall the conventions l/+oo = 0 and +oo — p = +oo used in this paper), 
the feasible region may be unbounded and it is possible to design feasible functions which drive standard 
performance criteria arbitrarily away from 0. Nevertheless, performance estimation on such nonsmooth 
functions could still be tackled after introduction of another appropriate Lipschitz condition on the class 
of functions, such as ||<?i ||2 < L. We leave this as a topic for further research and, in the rest of this text, 
restrict ourselves to the smooth case L < +oo. 


Our formulation (sdp-PEP) is dimension-independent, and computes the exact worst-case performance of 
a first-order method with N steps as long as the class of functions of interest contains functions of dimension 
at least N + 2. This corresponds to the so-called large-scale optimization setting, which is usually assumed 
when analyzing the worst-case of first-order methods. Function classes with smaller dimensions can also be 
handled with the addition of a (non-convex) rank constraint on G. We obtain the following result, whose 
proof is straightforward. 
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Proposition 2 Consider the setting of Theorem [5| If d < N + 2, the worst-case performance after N 
iterations of method H applied to some function in is equal to the optimal value of the semidefinite 

program \sdp-PEl ^ under the additional constraint rankG < d. 

Note that this also establishes that the sequence {w^ L (R, A4, N, Vb,c)}d= 1 , 2 ,... is monotonically increasing, 
and that it converges for a finite value of d. 

Corollary 3 The worst-case performance after N steps of a fixed-step method on a L-smooth (p-strongly) 
convex function is achieved by an N + 2-dimensional function. 


Finally, note that, when applied to the gradient method in the non-strongly convex case {p = 0), problem 
(sdp-PEP I is equivalent to one of the formulations proposed in m, more specifically to their problem (G). 
Theorem |5] establishes that this relaxation is in fact exact. 


3.4 A dual semidefinite program to generate upper bounds 


In general, it is not easy to find an analytical optimal solution to (sdp-PEP). Hence, we are also interested 


in a generic and easier way of obtaining analytical upper bounds on the performance of a given algorithm. 
A classical way of doing so is to work with the Lagrangian dual of (sdp-PEP |: 

inf tR 2 such that tAr — C + KjAij >z 0, (d-sdp-PEP) 

3 ijei 

b ^ ^ Xij (.Uj — 0, 


Xij > 0 , 

T > 0 , 


i,j e 1, 


whose feasible solutions will provide theoretical upper bounds on the worst-case behavior of every fixed-step 
first-order method (using weak duality). Note that the final dual formulation used in [TO], which deals with 
the case p = 0 , can be recovered by taking Xij = 0 for i + 1 / j or i 7 ^ = 1 = in our dual, i.e., it is a restriction 


of (d-sdp-PEP I with a potentially larger optimal value. 


The next theorem guarantees that no duality gap occurs between (sdp-PEP I and (d-sdp-PEP ) under the 
technical assumption that hi^-\ / 0 (i € {1,..., N}). This assumption is reasonable as it only implies that, 
at each iteration, the most recent gradient obtained from the oracle has to be used in the computation of 
the next iterate. The theorem will also guarantee the existence of a dual feasible point attaining the optimal 


value of the primal-dual pair of estimation problems (sdp-PEP I and (d-sdp-PEP I, i.e., a tight upper bound 
on the worst-case performance of the considered method. 

Theorem 6 The optimal value of the dual problem 
to ur f^(R, H, N, b, c) under the assumptions that hi y 


d-sdp-PEP) with 0 < p < L < 00 is attained and equal 
i/O for all i e iV}. 


Proof We use the classical Slater condition [ 8 ] on the primal problem in order to guarantee a zero duality 
gap — that is, we show that (sdp-PEP I has a feasible point with G >- 0. The reasoning is divided in two 


parts; we consider first the case p = 0 and L = 2 + 2cos( 7 r /(N + 2)), and we generalize it to general p < L 
afterwards. Consider the quadratic function fix) = ^x T Qx with the following tridiagonal positive definite 
matrix Q 

/2 1 0 0 


Q = 


1210 


0121 


V 


o\ 

0 

) 


y- 0 . 


We show how to construct a full-rank G feasible for (sdp-PEP) using the values of the quadratic function 
/. In order to do so, we exhibit a full-rank matrix 


P = [xo go gi ... ffjv] 
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corresponding to the application of a given method (with hij—i 7 ^ 0) to the quadratic function /. Indeed, 
choosing xo = -Rei, we can show that P is upper triangular with non-zero diagonal entries. Then we have 

go = Qx 0 = 2ei + e 2 , 
xi = xo — hi^ogo, 

gi = Qx 1 = go — h\ t oQgo = 2 ei + e 2 — hi,o( 4 ei + 4 e 2 + 63 ). 


Hence g\ has a non-zero element associated with e 3 whereas the only non-zero elements of go are associated 
with ei and e 2 - Now, we assume that gi-i has a non-zero element corresponding to e^+i and zero elements 
corresponding to ek for all k > i + 1 , while all previous gradients have zero components corresponding to e*, 
for all k > i. Then we have 


9i di-(-2 X j Qdi-\-2 Xi (e^-f-l T 2 T —(—3 ) ? 

with xJei +2 = xJei +3 = 0 and xjei+i 7 ^ 0 because of the recurrence assumption and the iterative form of 
the algorithm: 


i -2 


r i Ci -\-1 — Xo Ci+l ^ ' hi,k Qk dz+1 9i—l&i-\-l , 

k= ° s jo 

i — 2 

vJei +2 = x^ei- 1-2 — hi,1 c gk£i +2 ~hi,i -1 gJ-iei+2, 

i -2 

^i+3 = &i -\-3 ^ ^ Pfc ^i+3 hi,i—\ Qi — \&i-\-Z • 


fc =0 


Hence, gi has a non-zero element associated with ei+ 2 . We deduce that the following components are equal 
to zero by computing gJei+ 2 +k for k > 0 : 


gi &i-\-2-\-k %i ^^i+2+fc %i {&i-\-l-\-k 2c^_|_2-{-fc "h &i-\-3-\-k'): 

which is zero because of the algorithmic structure of Xi , i.e., 


Vi &i-\-l-\-k — ^O^i+l+A; ^ ^ hi,k gk ^i+l+fc hi,i— l l1 + A: • 


fc =0 


=0 


Hence, P is an upper triangular matrix with positive entries on the diagonal, and is therefore full-rank. In 
order to make this statement hold for general g < L, observe that the structure of the matrix is preserved 
using the operation (In +2 is the identity matrix) 

Q' = (Q - Amin(<3)-IiV+2) T--TVyT + /-ifjV+2■ 

The corresponding quadratic function is easily seen to be L-smooth and /r-strongly convex. Therefore, the 
interior of the domain of (sdp-PEP I is non-empty and Slater’s condition applies for /r < L, ensuring that no 
duality gap occurs and that the dual optimal value is attained. □ 


One can note that Theorem [ 6 ] guarantees the existence of a fully explicit proof (i.e., a combination of valid 
inequalities, or equivalently, a dual feasible solution) for any worst-case function (see the example at the end 
of this section). 
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3.5 Homogeneity of the optimal values with respect to L and R 

We observe that, for most performance criteria, one can predict how the worst-case performance depends 
from parameters L and R, provided the fixed step sizes contained in H are scaled appropriately (i.e., 
inversely proportional to L). In the rest of this paper we will only consider such scaled (normalized) step 
sizes. Therefore, the corresponding performance estimation problems have only to be solved numerically in 
the case R = 1 and L = 1, from which a general bound valid for any L and R can be deduced. 

More specifically, a classical reasoning involving appropriate scaling operations easily leads to the follow¬ 
ing homogeneity relations for the standard criteria /(*jv) — /*, ||V/(a;jv)|| 2 and ||xjv — x*|| 2 : 

w sdp L (R,H/L,N,f(x N ) - /*) = LR 2 (1, H, N, f(x N ) - /.), 

<?£(*, H/L, N, || V/(x jv )||^)=L 2 R 2 <f(l,H,JV,||V/(x JV )||^), 

<?£(*, H/L, N, \\x N - x*\\l) = R 2 (1, H, N, ||xjv - x* || 2 ), 

where k = fj,/L is the inverse condition number and H/L describes the fixed-step method obtained by 
dividing all step sizes hij by the Lipschitz constant L. Results in the rest of this paper implicitly rely on 
these relations. 


3.6 A simple example 


Consider the very simple case of a method performing a single gradient step using the non-standard step- 
size i.e., xi = xo — ^V/(xo) (this is actually conjectured to be the best possible step size for a single 
step, see Section 4.1.11. One wishes to estimate the worst-case objective function accuracy after taking that 


step, i.e., maximize f(x i) — /*, over all L-smooth convex functions. Solving the corresponding semidefinite 
formulation (sdp-PEP) with n = 0, N ss: 1, H m (|) and Vb,c(f j G) = fi provides the optimal value 


w, 


sdp 

0 ,L 


R,{ 


) ’ 1, CD 


,o" 


LR 


attained by the following optimal solution 


fo = 


LR 2 


,/i = 


LR 2 


and G = LR 


< L —L/2 1 \ 

—L/2 L /4 -1/2 ^ 0. 

V 1 -1/2 1 /Lj 


This means that f{x i) — /* < holds for any / G To,l 


d ) for any d and provided that ||xo — x*|| < R. 
achieves this worst-case when started from xq = R. 


It is easy to check that function f{x) = =/x 2 G Hq,l{ 

Indeed one can successively evaluate /o = f(x o) = go = V/(xo) = LR, x\ = R — |R = — //. 

/i = f(x i) = LR- and g\ = — LR, This function is one-dimensional since the optimal G has rank one (note 


that Corollary [3] only guaranteed the existence of a three-dimensional worst-case). 

Solving the dual problem ( |d-sdp-PEP| ) leads to the same optimal value , attained by optimal multi- 


pliers A 01 = A«o = A*i = 1 > 0 and r — 4. The corresponding dual slack matrix is 


/ l/L l/L -l/2\ /-1/A 

S = f l/L l/L -1/2 = - -l/L (-l/L -l/L 1/2) ^ 0. 

2 V-1/2-1/2 L/A 2 V 1/2 / 


From this dual solution, a fully explicit proof of the worst-case performance can be derived, which can be 
checked independently without any knowledge about our approach. Indeed, linear equalities in the dual 
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imply that the objective accuracy /(x 1 ) — /* can be written exactly as follows 


f(x i) - /(x*) = - ( /(xi) - /(xo) + V/(xi) T (x 0 - xi) + 2 ^||V/(x 0 ) - V/(xi)|| 


1 


1 


+ x f(x o) - /(x*) + V/(x 0 ) (x* - x 0 ) + xrl|V/(x q) - V/(x 


1 


2L 

1 


+ o /fai) “ /(**) + v /( a; i) (®* - *0 + 77ri|V/(xi) - V/(x 


+ ^l|xo — X*II 2 - ^ 


2L 

i(xo 

Z Ju Ju 


(where for the last term we write the quadratic form Tr ( SG ) as a square, since S is rank-one). This equality, 
which is straightforward to check using xi = xo— y|;V/(xo) and V/(x*) = 0, immediately implies inequality 
/(x 1 ) — /» < -(f11Xo — x*|| 2 , since the first three bracketed expressions are nonpositive because of inequalites 
from Corollary [l] valid for all functions in To ,l- 


4 Numerical performance estimation of standard first-order algorithms 

In this section we apply the convex PEP formulation to study convergence of the fixed-step gradient method 
(GM), the standard fast gradient method (FGM) and the optimized gradient method (OGM) proposed 
by m- We begin with the GM for smooth convex optimization, whose worst-case is conjectured in [10] 
to be attained on a simple one-dimensional function. Numerical experiments with our exact formulation 
confirm this conjecture. Further experiments on the worst-case complexity for different methods, problem 
classes and performance criteria lead to a series of conjectures based on worst-case functions possessing a 
similar shape. We conclude this section with the study of a nonlinear performance criteria corresponding to 
the smallest gradient norm among all iterates computed by the method. 

All numerical results in this section were obtained on an Intel 3.5Ghz desktop computer using a combina¬ 
tion of the YALMIP modeling environment in MATLAB [T7], the MOSEK [18] and SeDuMi [25] semidehnite 
solvers and the VSDP (verified semidehnite programming) toolbox |11| . 


4.1 Gradient method 

4.1.1 A conjecture on smooth convex functions by Drori and Teboulle m 

Consider the classical fixed-step gradient method (GM) wit h con stant step sizes applied to a smooth convex 
function in J-o,z,(R d ). Following the discussion in section 3.5 we use normalized step sizes J , inversely 
proportional to L. 

Gradient Method (GM) 

Input: / G .Fo,L(]R d ), x 0 G R d , yo = x 0 . 

For i = 0 : N — 1 

X j-)-l — Xi “V f{xf) 


The following conjecture on the convergence of the worst-case objective function values was made in m, 

Conjecture 1 m, Conjecture 3.1.) Any sequence of iterates {xi} generated by the gradient method GM 
with constant normalized step size 0 < h < 2 on a smooth convex function / G J-o,i(R d ) satisfies 

f(x N ) -f*< LH 2 max ( 2]V/ J TI , (1 - hf 
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A proof of the conjecture is provided in m for step sizes 0 < h < 1, leaving the case 1 < h < 2 open. We 
also recall that the upper bound in this conjecture cannot be improved, as it matches the performance of 
the GM on two specific one-dimensional functions. Indeed, define 


h{x) 

h(x) 


LR | 

2Nh+l I 


L 2 
2 X ' 


LR 1 2 

2(2Nh+l) 2 


if |o:| > 


R 

2JV/1+1 ’ 


else, 


It is straightforward to check that the final objective value accuracy of GM on /i is equal to 2jv } +1 > 

and that it is equal to ^-(1 — h) 2N on fi. This means that the conjecture can be reformulated as saying 
that the worst-case behavior of the GM according to objective function accuracy is achieved by function f\ 
or / 2 , depending on which of the two is worst (which will depend only on the normalized step size h and 
number of iterations N). 

Intuitively, the behavior of GM on piecewise affine-quadratic /i corresponds to a situation in which 
iterates slowly approach the optimal value without oscillating around it (i.e., no overshooting), whereas GM 
applied on purely quadratic /2 generates a sequence oscillating around the optimal point. Those behaviors 
are illustrated on Fig. [3] We also note that iterates for /i stay on the affine piece of the function, and even 
never come close to the quadratic piece. Interestingly, the existence of a one-dimensional worst-case function 
with a simple affine-quadratic shape will also be observed for the other algorithms studied in this section, 
both in the smooth convex and in the smooth strongly convex cases. 



Fig. 3 Behavior of the gradient method on fi (left) and fi (right), for L = R = 1. We observe that GM does not overshoot 
the optimal solution on fi , while it does so at each iteration on / 2 . 


Empirical results from the numerical resolution of ( |sdp-PEPj ) strongly support Conjecture [l] Indeed, 
when comparing its predictions with numerically computed worst-case bounds, we obtained a maximal 
relative error of magnitude 10 -7 (all pairs of values N G {1,2, ...,30} and h G {0.05, 0.10,..., 1.95} were 
tested). It is also worth pointing out that the Gram matrices computed numerically correspond to the 
one-dimensional worst-case functions /i and fi introduced above. 

Before going into the details of other methods, we underline another observation coming from m- 
Conjecture [l] also suggests the existence of an optimal step size h op t(N) for the GM — optimal in the sense 
of achieving the lowest worst-case. That is, if you know in advance how many iterations of the GM you will 
perform, it suggests using a step size h op t(N ) that is the unique minimizer of the right-hand side of the 
Conjecture [l] for a fixed value of N. It is obtained by solving the following non-linear equation in /i op t (for 
which no closed-form solution seems to be available): 

1 — n u \ 2N 

2Nh op t + 1 _ 1 kopt) ■ 

1 This equation possesses several solutions, but the optimum is the unique point where the two terms feature derivatives 
of opposite signs (a necessary and sufficient condition for the maximum of two convex functions of one variable). This point 

can easily be computed numerically with an appropriate bisection method. 
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This optimal step size can be interpreted in terms of the trade-off between what we obtain on functions /i 
and fi. On the one hand, we ensure that we are not going too slowly to the optimal point on /i, and on the 
other hand we do not want to overshoot too much on / 2 . 

Assuming Conjecture [T] holds true, one can show that the optimal step size is an increasing function of N 
with 3/2 < h 0 pt(N) < 2 and h op t(N) —» 2 as N —> oo. More precisely, working out the expression defining 
h-opt gives the following tight lower and upper estimates: 

2 - l0 f^ V ~ 1 + (1 + 4A0“ 1/(2JV) < hopt(iV) < 1 + (1 + 2iV)- 1/(2JV) ~ 2 - 1 °|^ V . (5) 

It is interesting to compare the results from the relaxation (G’) proposed for GM in [TO] with ours, for 
values of the normalized step size h that are close to h op t- Indeed, while the results of the two formulations 
are quite similar for most values of h, it turns out that those from m are significantly more conservative 
in the zone around h op t, as presented in Table [l] for different values of N. This also formally establishes the 
fact that the formulation from m is a strict relaxation of the performance estimation problem. 


N 

hopt 

Conjecture [l] 

Value computed in [10] 

Rel. error 

Value from l|sdp-PEP| 

Rel. error 

1 

1.5000 

LR 2 /8.00 

LR 2 / 8.00 

0.00 

LR 2 /8.00 

7e-09 

2 

1.6058 

LR 2 / 14.85 

LR 2 /14.54 

2e-02 

LR 2 / 14.85 

5e-09 

5 

1.7471 

LR 2 / 36.94 

LR 2 / 32.57 

le-01 

LR 2 / 36.94 

le-08 

10 

1.8341 

LR 2 / 75.36 

LR 2 / 59.80 

3e-01 

LR 2 / 75.36 

3e-08 

20 

1.8971 

LR 2 / 153.77 

LR 2 / 109.58 

4e-01 

LR 2 / 153.77 

6e-08 

30 

1.9238 

LR 2 /232.85 

LR 2 / 156.23 

5e-01 

LR 2 /232.85 

7e-08 

40 

1.9388 

LR 2 / 312.21 

LR 2 /201.10 

6e-01 

LR 2 18X1.21 

3e-08 

50 

1.9486 

LR 2 / 391.72 

L.R 2 /244.70 

6e-01 

LR 2 /391.72 

le-07 

100 

1.9705 

LR 2 / 790.22 

LR 2 / 451.72 

7e-01 

LR 2 / 790.22 

le-07 


Table 1 Gr adient Me thod with fj, = 0, worst-case computed with relaxation from m and worst-case obtained by exact 
formulation (|sdp-PEP|| for the criterion /(.r,y) — /*. Error is measured relatively to the conjectured result. Results obtained 
with MOSEK~fl8r 


These numerical results have been obtained with MOSEK, a standard semidefinite optimization solver. 
Despite convexity of the formulation, it might happen that the solution returned by such as solver is inaccu¬ 
rate, and in particular (slightly) infeasible. In that case, the objective value of the approximate primal (resp. 
dual) solution is no longer guaranteed to be a lower (resp. upper) bound on the exact optimal value, hence 
potentially negating the advantage of an exact convex formulation. For this reason, all numerical results re¬ 
ported in this section have been double checked with an interval arithmetic-based semidefinite optimization 
solver HD that returns an interval that is guaranteed to contain the optimal value. These guaranteed bounds 
are reported in Table [2] for the case h = 1.5, which compares them with Conjecture [l] 


N 

i 

2 

5 

10 

15 

20 

30 

Relative error (upper limit) 

2e-09 

7e-10 

2e-09 

le-09 

9e-10 

le-09 

9e-10 

Conjecture 

1.2e-01 

7.1e-02 

3.1e-02 

1.6e-02 

l.le-02 

8.2e-03 

5.5e-03 

Relative error (lower limit) 

2e-09 

3e-09 

9e-09 

9e-08 

2e-07 

3e-07 

9e-07 


Table 2 Gradient method with relative step size h = 1.5: numerical values from Conjecture ijand relative error for the 
upper and lower limits of the guaranteed interval obtained numerically with VSDP eh and SeDuMi SB). 


We can observe that the use of a verified solver does not impact our conclusions about the validity of 
the conjecture. Moreover, this table is typical of what we observed for all conjectures in this section: all 
numerical results reported were validated 2 ] and in what follows we will no longer mention this verification 
explicitly. 

2 Except for tests where validation encountered numerical difficulties, i.e for which VSDP returned no valid interval, which 
occurred more and more frequently as the value of the worst-case bound became closer to zero. 
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Finally, we compare results obtained with Conjecture[l]with classical analytical bounds from the literature 
for the GM with unit normalized step size h = 1 (which is usually recommended, and sometimes called 
optimal). The best analytical bound we could find, e.g. in [?J, states that 


f(x N ) - /* < 


LR 2 1 
2 N + 1' 


( 6 ) 


This analytical bound is asymptotically worse by a factor of 2 than the bound predicted by Conjecture [l] 
with h = 1. Similarly, one can investigate the effect of choosing the optimal normalized step size h D P t (N) 
instead of h = 1: Conjecture[l]then predicts another improvement by a factor of 2. These observations follow 
from the asymptotic (large N ) behaviors of the different worst-case bounds on /( xn ) — /*, which can easily 
be computed: 


Conjecture [l] with h = 1—>• 


LR 2 


1 


2 2IV T 1 


Conjecture [l] with h = h op t(N) 


LR 2 


1 


N—>oo 2 AN + 1 


4-1.2 A generalized conjecture for strongly convex functions 


In view of the encouraging results obtained for the GM in the smooth case, we now study the behavior 
of the GM on the class of strongly convex functions R d ) using our formulation (sdp-PEPl with the 

same performance criterion, objective function accuracy. It turns out that the solution for every problem 
consisted again in a one-dimensional worst-case function (rank G = 1) of the same piecewise quadratic type. 
We therefore introduce the following general definitions for functions /i, T and fi~- 


fi,r(x) 


%x 2 + a T \x\ + b T if |o:| > t 
ffx 2 else, 


h(x) = ^ x 2 , 

where scalars a T = (L — /r)r and b T = —( L ~f ^ )r 2 are chosen to ensure continuity of fi tT and its first 
derivative, and r is a parameter that controls the radius of the central quadratic piece (with the largest 
curvature). Although the value of parameter r could in principle be estimated from the numerical solutions 
of our problems, it turns out it can be computed analytically by maximizing the final objective value /i, t (*jv) 
(assuming that all iterates stay in the affine zone |x| > t), which then leads to 

Rk 

(k — 1) + (1 — kK)~ 2N 

where k = ^ is the inverse condition number of the problem class / G We are now able to extend 

Conjecture Q] to the GM applied to strongly convex functions. 

Conjecture 2 Any sequence of iterates { Xi } generated by the gradient method GM with constant normalized 
step sizes 0 < h < 2 on a smooth strongly convex function / G J>,z,(R d ) satisfies 

/(*») - /• < Iff - >»ax ( (K _ 1| + ( )_ K , (1 


(7) 


As in the previous section, this conjecture states that the worst-case behavior of the GM according to 
objective function accuracy is achieved by function /i jT or / 2 , depending on which of the two is worse. 
Proceeding now to its numerical validation, we first point out that our results are intrinsically limited to 
the accuracy that can be reached by numerical SDP solvers. For this reason, we only report on situations 
for which Conjecture [ 2 ] predicts a final accuracy larger than 10~ l> , ensuring a few significant digits for the 
numerical results. The resulting estimated relative differences between Conjecture [2] and the numerical results 
obtained with (sdp-PEPj) are given in Table [3j for different values of k. We observe that the conjecture is 
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K, = 0 

.001 

.005 

.010 

.015 

0.1 

0.2 

0.5 

Rel. error 6e-10 

7e-10 

4e-10 

6e-10 

8e-10 

2e-07 

9e-08 

le-06 


Table 3 Maximum relative estimated differences between Conjecture [2] and corresponding numerical results obtained with 
SeDuMi 1251 . The maximum is taken over all N £ {1,..., 30} and h £ (0.05,..., 1.95} for which the conjecture predicts a 
worst-case larger than 10 -6 . 


very well supported by our numerical results, with a largest relative error around 10 -6 , reached for the 
largest value of k considered here. This is expected as GM tends to perform better as k increases (i.e., final 
accuracy /(xjv) — /* approaches zero), which renders a precise comparison between numerical results and 
the conjecture more and more difficult. 

We now investigate some consequences of our conjecture. First, we note that Conjecture [2] tends to 
Conjecture 1 as p tends to zero. This is a consequence of the fact that r tends to 2 WK +1 as K tends to zero (one 
can also check that function f \, r tends to function /1 introduced earlier). Hence our formulation ( |sdp-PEP ) 
closes an apparent gap between worst-case analyses of the smooth convex and the smooth strongly convex 
cases. Indeed, to the best of our knowledge, existing worst-case bounds for the smooth strongly convex case 
do not converge to the smooth case as /x —» 0. 

It is also interesting to compare our results to those obtained with the IQC methodology of [16]. If we 
only care about asymptotic linear rates of convergence, Conjecture [ 2 ] predicts 

/(ajjv) — /* < max^K pf N , p% N } with pi = |1 — nh\ and p 2 = |1 — h\ 

(the first term in the max was obtained by neglecting (k — 1) in the denominator). On the other hand [ 161 
Section 4.4] proves that the distance to the solution converges linearly according to 

H^at — ar*|| < /? Ar ||a;o — :r*|| with a factor p = max{pi,p 2 ) 


with the same values for p\ and P 2 . This matches our asymptotic rate up to a multiplicative constant. 

As for Conjecture |TJ our new Conjecture [ 2 ] suggests optimal step sizes h op t{N, k), which can be obtained 
by solving the equation (for 0 < n < 1) 


K 

(k — 1) + (1 — re/i 0 pt) _2iv 


/1 7 \ 2 N 

(1 i^opt) 


( 8 ) 


(note that one recovers the previous equation for h op t when p tends to zero). For a given N, as n increases 
from 0 to 1, those optimal step sizes decrease from h op t(N, 0) (optimal step size in the smooth case) to 
h op t(N, 1) = 1 (the latter being expected since it can only correspond to the case of function f -2 in the 
original ( |PEP I, for which the GM with h = 1 converges in one iteration). For a given k, we find that 
ho P t(N,n) increases as N increases, as in the smooth convex case, according to the following lower and 
upper estimates 


1 + 


K - 1 1 

-b — 

K Hi 


1 — Hi 


<h opt (N, k) < min < 1 + 


Hi Hi 


1 + Hi 


(9) 


which both tend to as N increases (the first term appearing in the min of the upper bound tends to 
2 — k, which is always greater than y—). This limiting normalized step size corresponds to step size 
yyAy that is often recommended for the GM, and sometimes called optimal. 

We now illustrate the improvements provided by Conjecture [2] with respect to the classical analytical 
worst-case bound found in the literature. When using normalized step size h = iterates from GM 

applied to functions in are known to satisfy (see [21] for example) 

/(«)-/.<^f(I^)“. (10) 

On the other hand, as the number of steps N tends to infinity, the true worst-case predicted by Conjecture[2] 
for the same step size asymptotically tends to , which is exactly the same as (JToJ) . Indeed, one 
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can check that this rate is equal to the second term appearing in the max of Conjecture [5J while the first 
term tends to ) 2N which is always smaller. 

One can however do better using the optimal step size h op t- Since it is not closed-form, we use the 
following approximate expression obtained after solving a suitable approximation of equation © 


hopt(JV) 


1 

1 + K2N 
1 + K 1+ 2V 


(note that h op t(N) tends to as N grows), and find that Conjecture [ 2 ] predicts 
tending to 


LR 2 

2 


K !+ re 


1 — K 
1 + K 


2 N 


a worst-case bound 


which improves the asymptotic rate by a factor 



(which can be shown to lie between and -). 


4.1.3 A conjecture on the gradient norm 


We now consider a different performance criterion, given by the norm of the gradient computed at the last 
iterate. Numerical experiments with our formulation suggest that results similar to those presented in the 
previous sections can be obtained both in the smooth convex and smooth strongly convex cases, based again 
on one-dimensional piecewise quadratic worst-case functions. Using the same definition for functions ,/j , r 
and / 2 , and choosing now the parameter r according to 


Rk 

(re — 1 ) + (1 — kK)~ n ’ 


( 11 ) 


we propose the following conjecture. 

Conjecture 3 Any sequence of iterates {xi} generated by the gradient method GM with constant normalized 
step sizes 0 < h < 2 on a smooth strongly convex function / € (R fi ) satisfies 

|V/(xAr)|| a < Mmax ( _ 1} + ( “ _ Kh) - N ■ |1 " ^ 


As for Conjecture [5J we limit our numerical validation to the cases where the worst-case values predicted 
by the Conjecture are larger than 10 —6 ; the largest relative error is about 10 -7 . 

We note that, as re tends to zero (i.e., the smooth case), Conjecture [ 3 ] tends to 

l|V/(2qv)|| 2 < LRmax p |1 “ ■ 


From that, we see that the optimal step size h^ pt (N : 0) for the GM is again an increasing function of N with 
y/2 < h^ pt (N,0) < 2 and h^ pt (N,0) —> 2 as N —> 00 . In the strongly convex case k > 0, the optimal step 
size is a decreasing function of k and satisfies h^ pt (N, n) — > 1 as re —> 1. As in the previous case, h^ pt (N, k) 
is bounded above by which we can confirm with the following lower and upper bounds on h^ pt : 


1 + 


re — 1 


+ 


(£3 


-l/N 


< ho pi (N, k) < min 1 + 


« - 1 , 1 /, \—N 

- + - 1 - re) 

re re 


-l/N 


In the smooth case, those bounds reduce to the simpler expression 


2 - 


log 2IV 
N 


1 + (1 + 2 N)~ 1/n < h* pt < 1 + (1 + N)~ 1/n ~ 2 - 


log N 

N ' 
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We now compare with a standard analytical worst-case bound. The iterates of the GM method with nor¬ 
malized step size are known to satisfy 


||ajjv - x*|| 2 < R 


1 — K 
1 + K 


and ||V/(xjv)|| 2 < L\\xn — x*\\ 2 < LR 


1 — K 
1 + K 


( 12 ) 


(see for example [21] for the left inequality, and use the L-Lipschitz property of the gradient along with 
V/(a;*) = 0 to derive the right inequality). The latter estimate is tight according to Conjecture |3j Using the 
following approximate optimal step size 


hJ pt (N) 


1_ 

1 + KN 
1 + K 1 + ^ 


(which tends to as N grows) can be shown to improve the conjectured asymptotic rate by the same 
_ 1 

factor k *+« as in the previous section. 


4.2 Fast gradient method and optimized gradient method 

In this section we assess the performance in the smooth convex case (y, = 0) of two accelerated first-order 
methods: the so-called fast gradient method (FGM) due to Nesterov [20], and an optimized gradient method 
(OGM) recently proposed by Kim and Fessler [14] . 


Fast Gradient Method (FGM) 

Input: / G J r 0 ,L(R d ), xo G R d , yo = xo, do = 1. 

For i = 0 : IV — 1 

Vi +1 =Xi- 

a i+ yWTi 

9i+1 ~ -2- 

Oi-l, 

Xi +1 = Vi+i + — a - (Vi+i ~ Vi) 

Vi +1 


Optimized Gradient Method (OGM) 

Input: / G J r o,L(R d ), xo G R d , yo = xo, 9 0 = 1. 
For * = 0 : IV — 1 

Vi+ 1 =Xi - j r Vf(Xi) 


i +1 = 


1+ V4fl? +1 , j< N _2 

i+y/ae*+i N - i 


9i — 1 , . 9i . . 

x i+ i = y i+ 1 -I- — -(j/i+i - Vi) + o -(2/i+i - Xi) 

Vi+l 1 


Both of these algorithms are defined in terms of two sequences: {yi} i is a primary sequence, and {xi} i is 
a secondary sequence, where the gradient is evaluated. We first show that both of these algorithms can be 
expressed as fixed-step first-order methods, which we defined as 


2 — 1 

Xi = x o - ^ ^z,fcV/(xfc) (for L = 1). 

k =0 
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One way to proceed is to focus on the secondary sequence {xi} i and substitute the yi s in the algorithm 
formulation. For FGM, we have 


9i , Qi ~ 1 
Xi-\-l — Xi —— “h —— 

L t>i +1 


... , _.H' 4 ■)' 

•*->1 J-'t—l j I j 


i ). 


= Xi + 


- 1 


i+1 


(Xi - Xi-i) - 


-1 


e , 


+i 


+ 1 >f + 


i - 1 9i -l 
9 l+ i L ’ 


which allows to obtain the step sizes relative to Xg by recurrence: 




L , Bi-1 

hi ’ k + !)RC 

hi- 

-1 ,k) 

if k 

< i — 

2, 

' h *> k + 

{hi,k 1) 


if k 

= i - 

1, 

iP 1 + 1 



if k 

= i, 








= 0 if k < 0 

and = 

-- 0 for 

all k. 

Similarly, we have for OGM 

( h 1 9i-l 

+ ~e~Ci 

hi- 

-i.fc) 

if k 

< i — 

2, 

‘ hi ’ k + 

{hi,k 1) 


if k 

= i - 

1, 

f- 1 +1 

Vi + 1 



if k 

= i, 



with the same initial conditions. This approach will provide estimates for the last secondary iterate xjv- If 
an estimate for last primary iterate y n is needed, one just has to replace the expression of xn by j/jv, which 
is done by using the following alternative coefficients for the last step: 


_ / h N - i ife if k < N - 2, 
hN ’ k ~ { 1 if k = TV — l t 

for both FGM and OGM. 

Again, our numerical experiments strongly suggest the same assumption about the shape of the worst- 
case functions, i.e., one-dimensional and piecewise quadratic (with iterates staying in the affine zone of /i, T )- 
Using this property, we are able to compute the following values of r achieving the worst-case final objective 
accuracy, which surprisingly hold for both the classical FGM and the more recent OGM (a coincidence for 
which we can offer no explanation) 

R R 

t\ = - 75 - for the primary sequence, T 2 = - N _ l - for the secondary sequence. 

t ^‘^2/k =0 h'N— l,fc T 3 ‘^ J ^2/k =0 ^N,k 1 

Our numerical results suggest the following two conjectures (validations for both conjectures were performed 
for values of N G {1,..., 100} and displayed a relative error less than 10 -4 ). 

Conjecture 4 Any (primary) sequence of iterates {yi} generated by the fast gradient methd FGM (resp. 
optimized gradient method OGM) on a smooth convex function / G satisfies 

KVn ) - /* < fl , rAyi , N ) = JV-2 , 1 -— > 

Z 2^k =0 n N-l,k+o 

where yi,N is the final (primary) iterate computed by FGM (resp. OGM) applied to fi, Tl starting from 
xg = R, and quantities liN-i,k are the fixed coefficients of the last step of FGM (resp. OGM). 


Conjecture 5 Any (secondary) sequence of iterates {xi} generated by the fast gradient methd FGM (resp. 
optimized gradient method OGM) on a smooth convex function / G satisfies 

f(x N ) ~ /* < fi, T2 (xi,N) = N _ ± \ -—, 

where aq^jv is the final (secondary) iterate computed by FGM (resp. OGM) applied to /i,t 2 starting from 
Xg = R, and quantities hjv.fc are the fixed coefficients of the last step of FGM (resp. OGM). 
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The worst-case bounds in these two conjectures involve the normalized step sizes of the FGM and OGM 
methods. It turns out these can be computed in closed form for OGM (see also [13]), and give (N > 1) 


/(2/Jv) - /* < 


LR 

40jv-i 


2 

- < 

+ 2 - 


LR 2 2 , x , ^ LR 2 ^ LR 2 2 

2 (N + l) 2 + 2 an ~ 20% ~ 2 (N + 1)(N + 1 + y/2) 


(where the inequalities rely on the bounds 0%-_i > and 9% > ( jv + 1 )( J ^+ 1 +v / ^) y We were not able to 

obtain similar closed-form bounds for the FGM. 

We now compare the numerical values obtained with Conjectures [4] and [5] with analytical bounds known 
for the FGM. We use for the primary sequence 


f(VN) - f*< 


2 LR 2 


(IV+1)2’ 

which can be found in [3], and for the secondary sequence 


f{x n) - /* < 


2LR 
(N + 2)2 


(13) 


(14) 


which was very recently derived in m. The comparison is displayed on Fig. [4j The asymptotic behaviors of 
both sequences are well captured by the analytical bounds (131 and (14), but we observe that the estimation 


of the transient worst cases are improved by our conjectures: a factor approximately equal to 1.15 is gained 
for both sequences after 30 iterations. 

Before going into the next section, we comment on the applicability of our results to monotone variants 
of first-order methods, i.e. methods which guarantee f(%n+ i) < f{yi)- Consider for example FISTA 3], which 
is equivalent to FGM when applied to smooth unconstrained minimization. MFISTA [3], a monotone variant 
of FISTA, happens to generate a monotonically decreasing sequence {/(yi))i when applied to our worst-case 
function fi, Tl from xo = R. This means that the corresponding lower bound from Conjecture 4 also applies 
to MFISTA. 



versus Conjecture 


0 
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4.3 Estimation of the smallest gradient norm among all iterates 


First-order methods are often used in dual approaches where, in addition to objective function accuracy, 
gradient norm plays an important role. Indeed, this quantity controls primal feasibility of the iterates (see 
e.g., 0). Considering for example the accelerated FGM in the smooth case, we know from the previous 
section that the classical analytical bound on the worst-case accuracy for a function in is given by 

(N+iy ■ F rom that bound, it is easy to obtain a similar bound on the last gradient norm, using Corollary |TJ 

\\Vf(y N )\\ 2 < \/2 L(f(y N ) - /*) < (15) 

Observe that this asymptotic rate is significantly worse than that of the objective function accuracy, and 
not better than that of the gradient method GM (see Conjecture [3]). 

However, it is well-known that the norm of the gradient is not decreasing monotonically among iterates 
of the FGM. Hence, in this section, we will estimate the worst-case performance of FGM according to the 
smallest observed gradient norm among all iterates: 


ig{0,...,JV} 


In order to do so, only a slightly modified version of (sdp-PEP) is needed: this min-type objective function is 


representable using a new variable t for the objective and N +1 additional linear inequalities t < || V/(j/i) || ^ 
t < Gi,i for all 0 < i < TV. Note that the maximum is still attained since this concave piecewise linear 
objective function is continuous. 

This criterion was suggested in (22], which proposes a variant of FGM that consists in performing N/2 
steps of the standard FGM followed by N/2 steps of the GM with h — 1. It is then theoretically established 
that this variant of FGM, which we denote by MFGM, satisfies 




SLR 


(16) 


an improvement compared to the rate of convergence of the gradient of the last iterate. 

We now compare FGM with this modified variant MFGM using our performance estimation formulation. 
Fig. [5]compares the behaviors of those methods in both their last (for FGM) and best iterates, as well as the 
above analytic bounds (15) and (161. This experiment confirms that the gradient norm of the last iterate 
of FGM decreases according to the slower ©(IV -1 ) rate of 


151). We also observe that both the MFGM and 

o / A 

the original FGM achieve the same 0{N~ ' ) convergence rate for the smallest gradient norm, which was 
not known before for FGM. In addition, numerical results reported in Table [4] suggest that FGM performs 
slightly better than MFGM. 


N 

FGM, analytic 

EL 

FGM, last 

FGM, best 

MFGM, analytic |l6| 

MFGM, best 

2 

LR/1.50 


LR/ 3.00 

LR/ 3.00 

LR/ 0.35 

LR/ 3.00 

4 

LR/ 2.50 


LR/5.84 

LR/ 5.84 

LR/ 1.00 

LR/ 5.00 

10 

LR/ 5.50 


LR/15.14 

LR/15.62 

LR/ 3.95 

LR/12.66 

20 

LR/10.50 


LR/ 25.08 

LR/34.49 

LR/11.18 

LR/ 30.77 

30 

LR/15.50 


LR/ 35.13 

LR/58.50 

LR/ 20.54 

LR/ 55.38 

40 

LR/ 20.50 


LR/45.19 

LR/86.17 

LR/31.62 

LR/86.41 

50 

LR/ 25.50 


LR/ 55.25 

LR/117.08 

LR/44.19 

LR/119.63 

100 

LR/ 50.50 


LR/105.49 

L.R/311.34 

LR/125.00 

LR/296.58 

200 

LR/100.50 


LR/ 205.77 

LR/850.59 

LR/ 353.55 

L.R/791.87 


Table 4 FGM and MFGM: comparison between theoretical bounds and numerical results for the criteria 11V T(yisr )IIo(last) 
and min, ||V/(yi)|| 2 (best). Results obtained with m ■ 
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Iteration count (log scale) 


Fig. 5 Comparison of gradient nor m co nvergence rates for the FGM and the MFGM from m■ Theoretical guarantees are 
dashed. Analytical bound on FGM ( |15| ) in its last iterate (dashed blue); numerical worst- case for FGM at its last iterate 
(blue); numerical worst-case for FGM at its best iterate (red); analytical bound on MFGM i ] 1 6[ i for the best iterate (dashed 
black); numerical worst-case for MFGM at its best iterate (black). 


A regularization technique is also described in [22], featuring a 0(N~ 2 ) convergence rate up to a log¬ 
arithmic factor. A drawback of this approach is that it requires a bound on the distance to the optimal 
solution, and that the coefficients of the method explicitly depend on this bound. No fixed-step method 
achieving the same 0(N~ 2 ) seems to be known. 


5 Conclusion 


The contribution of this paper is threefold: first, we present necessary and sufficient conditions for smooth 
strongly convex interpolation. Those conditions are derived by showing an explicit way of constructing 
the interpolating functions. Second, we show that the exact worst-case performance of any fixed-step first- 
order algorithm for smooth strongly convex unconstrained optimization can be formulated as a convex 
problem. In this context, our interpolation procedure also provides explicit functions achieving the worst-case 
bounds computed by our approach. Third, we test of our formulation numerically on a variety of functions 
classes, first-order methods and performance criteria, establishing on the way a series of conjectures on the 
corresponding worst-case behaviors. In particular, we suggest new tight estimates of the optimal step size 
for the fixed-step gradient method with constant step size, which depend on the number of iterations and 
the condition number. 


Our performance estimation problem provide a generic tool to analyze fixed-step first-order methods. 
It allows computing both exact worst-case guarantees and functions reaching them, and provides a unified 
algorithmic analysis for smooth convex functions and smooth strongly convex functions. 

The exact worst-case values provided by our approach require solving a convex semidefinite program 
whose size grows as the square of the number of iterations considered, which may become prohibitive when 
this number of iterations is large. This can be avoided using iteration-independent bounds, as proposed 
in [H], but at the cost of obtaining poorer worst-case guarantees. 

Further improvements to our approach include an extension of (PEP) to more general methods, such as 
first-order methods equipped with line search, or first-order methods designed to work on a restricted convex 
feasible region (projected gradient). Another desirable feature would be the ability to optimize the step sizes 
of the method considered in (sdp-PEP I, as was proposed in mm for the relaxed version of (|PEP[). 
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Software. Our semidefinite programming approach to performance estimation has been implemented 
with MATLAB code, which can be downloaded from http://perso.uclouvain.be/adrien.taylor. This 
routine features an easy-to-use interface, which allows the estimation of the worst-case performance of a given 
fixed-step first-order algorithm (to be chosen among a pre-defined list or to be specified by its coefficients) 
on a given class of functions, for a given performance criterion, after any number of steps. 
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